The Ford GoBike System Dataset contains information about rides in a bike sharing system in the greater San Franciso Bay area.
Preliminary Wrangling¶
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import math
%config InlineBackend.figure_format='retina'
df = pd.read_csv('201902-fordgobike-tripdata.csv')
Checking and Removing Rows with missing data¶
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 183412 entries, 0 to 183411 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 duration_sec 183412 non-null int64 1 start_time 183412 non-null object 2 end_time 183412 non-null object 3 start_station_id 183215 non-null float64 4 start_station_name 183215 non-null object 5 start_station_latitude 183412 non-null float64 6 start_station_longitude 183412 non-null float64 7 end_station_id 183215 non-null float64 8 end_station_name 183215 non-null object 9 end_station_latitude 183412 non-null float64 10 end_station_longitude 183412 non-null float64 11 bike_id 183412 non-null int64 12 user_type 183412 non-null object 13 member_birth_year 175147 non-null float64 14 member_gender 175147 non-null object 15 bike_share_for_all_trip 183412 non-null object dtypes: float64(7), int64(2), object(7) memory usage: 22.4+ MB
df = df.dropna()
df.head()
| duration_sec | start_time | end_time | start_station_id | start_station_name | start_station_latitude | start_station_longitude | end_station_id | end_station_name | end_station_latitude | end_station_longitude | bike_id | user_type | member_birth_year | member_gender | bike_share_for_all_trip | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 52185 | 2019-02-28 17:32:10.1450 | 2019-03-01 08:01:55.9750 | 21.0 | Montgomery St BART Station (Market St at 2nd St) | 37.789625 | -122.400811 | 13.0 | Commercial St at Montgomery St | 37.794231 | -122.402923 | 4902 | Customer | 1984.0 | Male | No |
| 2 | 61854 | 2019-02-28 12:13:13.2180 | 2019-03-01 05:24:08.1460 | 86.0 | Market St at Dolores St | 37.769305 | -122.426826 | 3.0 | Powell St BART Station (Market St at 4th St) | 37.786375 | -122.404904 | 5905 | Customer | 1972.0 | Male | No |
| 3 | 36490 | 2019-02-28 17:54:26.0100 | 2019-03-01 04:02:36.8420 | 375.0 | Grove St at Masonic Ave | 37.774836 | -122.446546 | 70.0 | Central Ave at Fell St | 37.773311 | -122.444293 | 6638 | Subscriber | 1989.0 | Other | No |
| 4 | 1585 | 2019-02-28 23:54:18.5490 | 2019-03-01 00:20:44.0740 | 7.0 | Frank H Ogawa Plaza | 37.804562 | -122.271738 | 222.0 | 10th Ave at E 15th St | 37.792714 | -122.248780 | 4898 | Subscriber | 1974.0 | Male | Yes |
| 5 | 1793 | 2019-02-28 23:49:58.6320 | 2019-03-01 00:19:51.7600 | 93.0 | 4th St at Mission Bay Blvd S | 37.770407 | -122.391198 | 323.0 | Broadway at Kearny | 37.798014 | -122.405950 | 5200 | Subscriber | 1959.0 | Male | No |
df.info()
<class 'pandas.core.frame.DataFrame'> Index: 174952 entries, 0 to 183411 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 duration_sec 174952 non-null int64 1 start_time 174952 non-null object 2 end_time 174952 non-null object 3 start_station_id 174952 non-null float64 4 start_station_name 174952 non-null object 5 start_station_latitude 174952 non-null float64 6 start_station_longitude 174952 non-null float64 7 end_station_id 174952 non-null float64 8 end_station_name 174952 non-null object 9 end_station_latitude 174952 non-null float64 10 end_station_longitude 174952 non-null float64 11 bike_id 174952 non-null int64 12 user_type 174952 non-null object 13 member_birth_year 174952 non-null float64 14 member_gender 174952 non-null object 15 bike_share_for_all_trip 174952 non-null object dtypes: float64(7), int64(2), object(7) memory usage: 22.7+ MB
What is the structure of the dataset?¶
It's in a CSV structure.
What is/are the main feature(s) of interest in your dataset?¶
These are the main features of interest:
duration_secThe duration of the bike trip in secondsstart_station_idThe id of the bike station where the trip startedstart_station_latitudeThe beginning position's latitudestart_station_longitudeThe beginning position's longitudeend_station_idThe id of the bike station where the trip endedend_station_latitudeThe end position's latitudeend_station_longitudeThe end position's longitudeuser_typeShows if the rider is a subscriber or a customermember_birth_yearShows when the rider was bornmember_genderShows the gender of the rider.
What features in the dataset do you think will help support your investigation into your feature(s) of interest?¶
All of the ones previously mentioned.
Using Feature Engineering to create extra features¶
Creating a distance feature¶
Using the euclidean distance formula can be used to find distances between the start and end points of a bike trip, but it will return the distance in a value of coordinates, not really semantically useful but can generate useful analysis.
def Euclidean_Dist(df1, df2, cols=['x_coord','y_coord']):
return np.linalg.norm(df1[cols].values - df2[cols].values,
axis=1)
df['distance'] = np.nan
df['distance'] = np.sqrt(np.square(df['start_station_longitude'] - df['end_station_longitude']) + np.square(df['start_station_latitude'] - df['end_station_latitude']))
df.info()
<class 'pandas.core.frame.DataFrame'> Index: 174952 entries, 0 to 183411 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 duration_sec 174952 non-null int64 1 start_time 174952 non-null object 2 end_time 174952 non-null object 3 start_station_id 174952 non-null float64 4 start_station_name 174952 non-null object 5 start_station_latitude 174952 non-null float64 6 start_station_longitude 174952 non-null float64 7 end_station_id 174952 non-null float64 8 end_station_name 174952 non-null object 9 end_station_latitude 174952 non-null float64 10 end_station_longitude 174952 non-null float64 11 bike_id 174952 non-null int64 12 user_type 174952 non-null object 13 member_birth_year 174952 non-null float64 14 member_gender 174952 non-null object 15 bike_share_for_all_trip 174952 non-null object 16 distance 174952 non-null float64 dtypes: float64(8), int64(2), object(7) memory usage: 24.0+ MB
df['distance'].max()
0.6993993230710549
df.info()
<class 'pandas.core.frame.DataFrame'> Index: 174952 entries, 0 to 183411 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 duration_sec 174952 non-null int64 1 start_time 174952 non-null object 2 end_time 174952 non-null object 3 start_station_id 174952 non-null float64 4 start_station_name 174952 non-null object 5 start_station_latitude 174952 non-null float64 6 start_station_longitude 174952 non-null float64 7 end_station_id 174952 non-null float64 8 end_station_name 174952 non-null object 9 end_station_latitude 174952 non-null float64 10 end_station_longitude 174952 non-null float64 11 bike_id 174952 non-null int64 12 user_type 174952 non-null object 13 member_birth_year 174952 non-null float64 14 member_gender 174952 non-null object 15 bike_share_for_all_trip 174952 non-null object 16 distance 174952 non-null float64 dtypes: float64(8), int64(2), object(7) memory usage: 24.0+ MB
Using the Haversine formula, the distance between two cooridnate points can be calculated in kilometers or miles (in this case it will be kilometers). More information about the haversine formula can be found here
The haversine formula distances will be the distanecs used for the analysis.
from math import cos, asin, sqrt, pi
def distance_calc(lat1, lon1, lat2, lon2):
r = 6371 # km, 3958.756 for miles
p = pi / 180
a = 0.5 - cos((lat2-lat1)*p)/2 + cos(lat1*p) * cos(lat2*p) * (1-cos((lon2-lon1)*p))/2
return 2 * r * asin(sqrt(a))
distance_calc_generator = np.vectorize(distance_calc)
df['distance'] = distance_calc_generator(df['start_station_latitude'].values,
df['start_station_longitude'].values,
df['end_station_latitude'].values,
df['end_station_longitude'].values)
df['distance']
0 0.544709
2 2.704545
3 0.260739
4 2.409301
5 3.332203
...
183407 1.464766
183408 1.402716
183409 0.379066
183410 0.747282
183411 0.710395
Name: distance, Length: 174952, dtype: float64
Creating age feature¶
df['age'] = (2024 - df['member_birth_year']).astype(int)
df.head()
| duration_sec | start_time | end_time | start_station_id | start_station_name | start_station_latitude | start_station_longitude | end_station_id | end_station_name | end_station_latitude | end_station_longitude | bike_id | user_type | member_birth_year | member_gender | bike_share_for_all_trip | distance | age | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 52185 | 2019-02-28 17:32:10.1450 | 2019-03-01 08:01:55.9750 | 21.0 | Montgomery St BART Station (Market St at 2nd St) | 37.789625 | -122.400811 | 13.0 | Commercial St at Montgomery St | 37.794231 | -122.402923 | 4902 | Customer | 1984.0 | Male | No | 0.544709 | 40 |
| 2 | 61854 | 2019-02-28 12:13:13.2180 | 2019-03-01 05:24:08.1460 | 86.0 | Market St at Dolores St | 37.769305 | -122.426826 | 3.0 | Powell St BART Station (Market St at 4th St) | 37.786375 | -122.404904 | 5905 | Customer | 1972.0 | Male | No | 2.704545 | 52 |
| 3 | 36490 | 2019-02-28 17:54:26.0100 | 2019-03-01 04:02:36.8420 | 375.0 | Grove St at Masonic Ave | 37.774836 | -122.446546 | 70.0 | Central Ave at Fell St | 37.773311 | -122.444293 | 6638 | Subscriber | 1989.0 | Other | No | 0.260739 | 35 |
| 4 | 1585 | 2019-02-28 23:54:18.5490 | 2019-03-01 00:20:44.0740 | 7.0 | Frank H Ogawa Plaza | 37.804562 | -122.271738 | 222.0 | 10th Ave at E 15th St | 37.792714 | -122.248780 | 4898 | Subscriber | 1974.0 | Male | Yes | 2.409301 | 50 |
| 5 | 1793 | 2019-02-28 23:49:58.6320 | 2019-03-01 00:19:51.7600 | 93.0 | 4th St at Mission Bay Blvd S | 37.770407 | -122.391198 | 323.0 | Broadway at Kearny | 37.798014 | -122.405950 | 5200 | Subscriber | 1959.0 | Male | No | 3.332203 | 65 |
df.info()
<class 'pandas.core.frame.DataFrame'> Index: 174952 entries, 0 to 183411 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 duration_sec 174952 non-null int64 1 start_time 174952 non-null object 2 end_time 174952 non-null object 3 start_station_id 174952 non-null float64 4 start_station_name 174952 non-null object 5 start_station_latitude 174952 non-null float64 6 start_station_longitude 174952 non-null float64 7 end_station_id 174952 non-null float64 8 end_station_name 174952 non-null object 9 end_station_latitude 174952 non-null float64 10 end_station_longitude 174952 non-null float64 11 bike_id 174952 non-null int64 12 user_type 174952 non-null object 13 member_birth_year 174952 non-null float64 14 member_gender 174952 non-null object 15 bike_share_for_all_trip 174952 non-null object 16 distance 174952 non-null float64 17 age 174952 non-null int32 dtypes: float64(8), int32(1), int64(2), object(7) memory usage: 24.7+ MB
Univariate Exploration¶
df.info()
<class 'pandas.core.frame.DataFrame'> Index: 174952 entries, 0 to 183411 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 duration_sec 174952 non-null int64 1 start_time 174952 non-null object 2 end_time 174952 non-null object 3 start_station_id 174952 non-null float64 4 start_station_name 174952 non-null object 5 start_station_latitude 174952 non-null float64 6 start_station_longitude 174952 non-null float64 7 end_station_id 174952 non-null float64 8 end_station_name 174952 non-null object 9 end_station_latitude 174952 non-null float64 10 end_station_longitude 174952 non-null float64 11 bike_id 174952 non-null int64 12 user_type 174952 non-null object 13 member_birth_year 174952 non-null float64 14 member_gender 174952 non-null object 15 bike_share_for_all_trip 174952 non-null object 16 distance 174952 non-null float64 17 age 174952 non-null int32 dtypes: float64(8), int32(1), int64(2), object(7) memory usage: 24.7+ MB
Distribution of user_type¶
plt.figure(figsize=(5,5))
sns.countplot(df, x='user_type')
total_count = df['user_type'].value_counts().sum()
for i,count in enumerate(df['user_type'].value_counts()):
text_annotation = f'{100*count/total_count:.1f}%'
plt.text(i, count, text_annotation, fontsize=10, ha='center',va='bottom')
Distribution of which bikes were used most frequently based on bike_id¶
sns.histplot(df, x="bike_id");
Distribution of rent duration using duration_sec¶
There are outliers in the duration_sec column that were ignored during plotting, they are values over 2000 that came around to 4692 entries.
df['duration_sec'].describe()
count 174952.000000 mean 704.002744 std 1642.204905 min 61.000000 25% 323.000000 50% 510.000000 75% 789.000000 max 84548.000000 Name: duration_sec, dtype: float64
df['duration_sec'].sort_values(ascending=False).head(100)
85465 84548
127999 83519
112435 83407
5203 83195
95750 82512
...
69132 37276
50172 36586
3 36490
779 36190
118336 35855
Name: duration_sec, Length: 100, dtype: int64
len(df[df['duration_sec']>2000])
4692
df.loc[df['duration_sec'].sort_values(ascending=False).head(20).index]
| duration_sec | start_time | end_time | start_station_id | start_station_name | start_station_latitude | start_station_longitude | end_station_id | end_station_name | end_station_latitude | end_station_longitude | bike_id | user_type | member_birth_year | member_gender | bike_share_for_all_trip | distance | age | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 85465 | 84548 | 2019-02-16 15:48:25.0290 | 2019-02-17 15:17:33.0800 | 3.0 | Powell St BART Station (Market St at 4th St) | 37.786375 | -122.404904 | 368.0 | Myrtle St at Polk St | 37.785434 | -122.419622 | 6301 | Subscriber | 1981.0 | Male | No | 1.297555 | 43 |
| 127999 | 83519 | 2019-02-09 15:16:17.5370 | 2019-02-10 14:28:17.2700 | 72.0 | Page St at Scott St | 37.772406 | -122.435650 | 43.0 | San Francisco Public Library (Grove St at Hyde... | 37.778768 | -122.415929 | 5561 | Customer | 1990.0 | Male | No | 1.872044 | 34 |
| 112435 | 83407 | 2019-02-11 16:25:33.0690 | 2019-02-12 15:35:40.9560 | 77.0 | 11th St at Natoma St | 37.773507 | -122.416040 | 344.0 | 16th St Depot | 37.766349 | -122.396292 | 1842 | Customer | 1988.0 | Male | No | 1.909622 | 36 |
| 5203 | 83195 | 2019-02-27 14:47:23.1810 | 2019-02-28 13:53:58.4330 | 243.0 | Bancroft Way at College Ave | 37.869360 | -122.254337 | 248.0 | Telegraph Ave at Ashby Ave | 37.855956 | -122.259795 | 5781 | Subscriber | 1962.0 | Female | Yes | 1.565618 | 62 |
| 95750 | 82512 | 2019-02-14 13:56:21.7280 | 2019-02-15 12:51:34.3150 | 368.0 | Myrtle St at Polk St | 37.785434 | -122.419622 | 44.0 | Civic Center/UN Plaza BART Station (Market St ... | 37.781074 | -122.411738 | 6152 | Customer | 1998.0 | Other | No | 0.845597 | 26 |
| 8631 | 81549 | 2019-02-27 09:41:38.5520 | 2019-02-28 08:20:48.3860 | 138.0 | Jersey St at Church St | 37.750900 | -122.427411 | 140.0 | Cesar Chavez St at Dolores St | 37.747858 | -122.424986 | 2266 | Subscriber | 1963.0 | Female | No | 0.399848 | 61 |
| 107581 | 79548 | 2019-02-12 17:45:50.5360 | 2019-02-13 15:51:38.8590 | 79.0 | 7th St at Brannan St | 37.773492 | -122.403672 | 66.0 | 3rd St at Townsend St | 37.778742 | -122.392741 | 1718 | Customer | 1995.0 | Female | No | 1.124212 | 29 |
| 90195 | 74408 | 2019-02-15 16:54:01.0600 | 2019-02-16 13:34:09.3670 | 3.0 | Powell St BART Station (Market St at 4th St) | 37.786375 | -122.404904 | 86.0 | Market St at Dolores St | 37.769305 | -122.426826 | 4714 | Subscriber | 1988.0 | Male | No | 2.704545 | 36 |
| 86454 | 74097 | 2019-02-16 16:20:41.4650 | 2019-02-17 12:55:38.4670 | 99.0 | Folsom St at 15th St | 37.767037 | -122.415442 | 139.0 | Garfield Square (25th St at Harrison St) | 37.751017 | -122.411901 | 6235 | Subscriber | 1980.0 | Male | No | 1.808368 | 44 |
| 123383 | 73930 | 2019-02-10 13:03:36.4040 | 2019-02-11 09:35:46.4460 | 270.0 | Ninth St at Heinz Ave | 37.853489 | -122.289415 | 270.0 | Ninth St at Heinz Ave | 37.853489 | -122.289415 | 1333 | Subscriber | 1989.0 | Female | No | 0.000000 | 35 |
| 129176 | 72627 | 2019-02-09 15:15:59.2380 | 2019-02-10 11:26:26.3300 | 72.0 | Page St at Scott St | 37.772406 | -122.435650 | 72.0 | Page St at Scott St | 37.772406 | -122.435650 | 4641 | Customer | 1990.0 | Male | No | 0.000000 | 34 |
| 116671 | 72590 | 2019-02-11 11:26:36.9850 | 2019-02-12 07:36:27.3610 | 39.0 | Scott St at Golden Gate Ave | 37.778999 | -122.436861 | 52.0 | McAllister St at Baker St | 37.777416 | -122.441838 | 764 | Subscriber | 1978.0 | Male | No | 0.471516 | 46 |
| 129177 | 72576 | 2019-02-09 15:16:26.2830 | 2019-02-10 11:26:02.5440 | 72.0 | Page St at Scott St | 37.772406 | -122.435650 | 72.0 | Page St at Scott St | 37.772406 | -122.435650 | 4964 | Customer | 1990.0 | Male | No | 0.000000 | 34 |
| 145977 | 71470 | 2019-02-06 13:23:11.3570 | 2019-02-07 09:14:21.3660 | 368.0 | Myrtle St at Polk St | 37.785434 | -122.419622 | 21.0 | Montgomery St BART Station (Market St at 2nd St) | 37.789625 | -122.400811 | 4993 | Customer | 1998.0 | Other | No | 1.717457 | 26 |
| 29922 | 70925 | 2019-02-24 07:08:31.2700 | 2019-02-25 02:50:36.5900 | 375.0 | Grove St at Masonic Ave | 37.774836 | -122.446546 | 71.0 | Broderick St at Oak St | 37.773063 | -122.439078 | 5282 | Subscriber | 1989.0 | Other | No | 0.685363 | 35 |
| 14381 | 70211 | 2019-02-26 17:08:16.8970 | 2019-02-27 12:38:28.4360 | 80.0 | Townsend St at 5th St | 37.775235 | -122.397437 | 58.0 | Market St at 10th St | 37.776619 | -122.417385 | 5373 | Subscriber | 1990.0 | Male | No | 1.759961 | 34 |
| 32098 | 69980 | 2019-02-23 19:52:25.3350 | 2019-02-24 15:18:46.0720 | 21.0 | Montgomery St BART Station (Market St at 2nd St) | 37.789625 | -122.400811 | 369.0 | Hyde St at Post St | 37.787349 | -122.416651 | 2034 | Customer | 1993.0 | Male | No | 1.414775 | 31 |
| 54376 | 69803 | 2019-02-20 17:46:28.4200 | 2019-02-21 13:09:52.1210 | 5.0 | Powell St BART Station (Market St at 5th St) | 37.783899 | -122.408445 | 344.0 | 16th St Depot | 37.766349 | -122.396292 | 4760 | Customer | 1989.0 | Female | No | 2.224749 | 35 |
| 33431 | 69620 | 2019-02-23 16:33:41.5800 | 2019-02-24 11:54:02.4080 | 90.0 | Townsend St at 7th St | 37.771058 | -122.402717 | 321.0 | 5th St at Folsom | 37.780146 | -122.403071 | 3656 | Customer | 1993.0 | Male | No | 1.010985 | 31 |
| 120711 | 69335 | 2019-02-10 21:37:01.9930 | 2019-02-11 16:52:37.3450 | 321.0 | 5th St at Folsom | 37.780146 | -122.403071 | 67.0 | San Francisco Caltrain Station 2 (Townsend St... | 37.776639 | -122.395526 | 4637 | Customer | 1996.0 | Male | No | 0.769210 | 28 |
sns.histplot(df, x='duration_sec');
bins = np.arange(np.around(df['duration_sec'].min()),2000+50, 50)
sns.histplot(df, x='duration_sec',bins=bins,kde=True, kde_kws=dict(clip=(bins.min(), bins.max())));
Number of bike rentals per gender using member_gender¶
plt.figure(figsize=(5,5))
sns.countplot(df,x='member_gender');
total_count = df['member_gender'].value_counts().sum()
for i,count in enumerate(df['member_gender'].value_counts()):
text_annotation = f'{100*count/total_count:.1f}%'
plt.text(i, count, text_annotation, fontsize=10, ha='center',va='bottom')
Distribution of bike renters age using age¶
bins = np.arange(0,155,5)
sns.histplot(df, x='age',bins=bins);
It's highly unlikely anyone over 80 is going to be going on bicycle trips, these are considered outliers.
df['age'].describe()
count 174952.000000 mean 39.196865 std 10.118731 min 23.000000 25% 32.000000 50% 37.000000 75% 44.000000 max 146.000000 Name: age, dtype: float64
bins = np.arange(0,85,5)
sns.histplot(df, x='age',bins=bins);
Distribution of distance travelled by renters using distance¶
sns.histplot(df, x='distance');
df['distance'].sort_values(ascending=False)
112038 69.469241
19827 15.673955
50859 14.099709
153112 13.894462
87602 13.590843
...
99192 0.000000
169177 0.000000
82988 0.000000
99150 0.000000
77702 0.000000
Name: distance, Length: 174952, dtype: float64
There are some outliers in the distance column
bins = np.arange(np.around(df['distance'].min()),10, 0.25)
sns.histplot(df, x='distance',bins=bins,kde=True, kde_kws=dict(clip=(bins.min(), bins.max())));
len(df[df['distance']>10])
9
Some trips have a distance of zero, meaning the bikes were rented but were not used.
df[(df['start_station_latitude'] == df['end_station_latitude']) & ( df['start_station_longitude'] == df['end_station_longitude'] ) ]
| duration_sec | start_time | end_time | start_station_id | start_station_name | start_station_latitude | start_station_longitude | end_station_id | end_station_name | end_station_latitude | end_station_longitude | bike_id | user_type | member_birth_year | member_gender | bike_share_for_all_trip | distance | age | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 19 | 874 | 2019-02-28 23:43:05.1830 | 2019-02-28 23:57:39.7960 | 180.0 | Telegraph Ave at 23rd St | 37.812678 | -122.268773 | 180.0 | Telegraph Ave at 23rd St | 37.812678 | -122.268773 | 5629 | Customer | 1978.0 | Male | No | 0.0 | 46 |
| 27 | 408 | 2019-02-28 23:48:08.2820 | 2019-02-28 23:54:56.9300 | 78.0 | Folsom St at 9th St | 37.773717 | -122.411647 | 78.0 | Folsom St at 9th St | 37.773717 | -122.411647 | 5410 | Subscriber | 1982.0 | Male | No | 0.0 | 42 |
| 34 | 471 | 2019-02-28 23:42:43.3610 | 2019-02-28 23:50:34.4460 | 133.0 | Valencia St at 22nd St | 37.755213 | -122.420975 | 133.0 | Valencia St at 22nd St | 37.755213 | -122.420975 | 5559 | Subscriber | 1992.0 | Male | No | 0.0 | 32 |
| 55 | 3478 | 2019-02-28 22:39:35.0200 | 2019-02-28 23:37:33.3420 | 11.0 | Davis St at Jackson St | 37.797280 | -122.398436 | 11.0 | Davis St at Jackson St | 37.797280 | -122.398436 | 1846 | Subscriber | 1995.0 | Male | No | 0.0 | 29 |
| 56 | 3140 | 2019-02-28 22:44:53.5030 | 2019-02-28 23:37:14.0900 | 11.0 | Davis St at Jackson St | 37.797280 | -122.398436 | 11.0 | Davis St at Jackson St | 37.797280 | -122.398436 | 3040 | Subscriber | 1983.0 | Female | No | 0.0 | 41 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 183317 | 1476 | 2019-02-01 02:45:04.7440 | 2019-02-01 03:09:41.1840 | 345.0 | Hubbell St at 16th St | 37.766483 | -122.398279 | 345.0 | Hubbell St at 16th St | 37.766483 | -122.398279 | 5224 | Subscriber | 1967.0 | Male | No | 0.0 | 57 |
| 183318 | 877 | 2019-02-01 02:53:15.9950 | 2019-02-01 03:07:53.0580 | 385.0 | Woolsey St at Sacramento St | 37.850578 | -122.278175 | 385.0 | Woolsey St at Sacramento St | 37.850578 | -122.278175 | 4913 | Subscriber | 1987.0 | Male | No | 0.0 | 37 |
| 183326 | 5713 | 2019-02-01 01:02:55.1680 | 2019-02-01 02:38:09.0020 | 31.0 | Raymond Kimbell Playground | 37.783813 | -122.434559 | 31.0 | Raymond Kimbell Playground | 37.783813 | -122.434559 | 5366 | Subscriber | 1972.0 | Male | No | 0.0 | 52 |
| 183350 | 874 | 2019-02-01 01:41:43.4140 | 2019-02-01 01:56:17.5520 | 253.0 | Haste St at College Ave | 37.866418 | -122.253799 | 253.0 | Haste St at College Ave | 37.866418 | -122.253799 | 3232 | Subscriber | 1995.0 | Male | Yes | 0.0 | 29 |
| 183380 | 943 | 2019-02-01 00:43:11.5500 | 2019-02-01 00:58:55.2170 | 31.0 | Raymond Kimbell Playground | 37.783813 | -122.434559 | 31.0 | Raymond Kimbell Playground | 37.783813 | -122.434559 | 5343 | Subscriber | 1972.0 | Male | No | 0.0 | 52 |
3458 rows × 18 columns
Cleaning based on distributions¶
Cleaning distance¶
len(df[df['distance'] > 10])
9
df = df[df['distance'] <= 10]
df.info()
<class 'pandas.core.frame.DataFrame'> Index: 174943 entries, 0 to 183411 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 duration_sec 174943 non-null int64 1 start_time 174943 non-null object 2 end_time 174943 non-null object 3 start_station_id 174943 non-null float64 4 start_station_name 174943 non-null object 5 start_station_latitude 174943 non-null float64 6 start_station_longitude 174943 non-null float64 7 end_station_id 174943 non-null float64 8 end_station_name 174943 non-null object 9 end_station_latitude 174943 non-null float64 10 end_station_longitude 174943 non-null float64 11 bike_id 174943 non-null int64 12 user_type 174943 non-null object 13 member_birth_year 174943 non-null float64 14 member_gender 174943 non-null object 15 bike_share_for_all_trip 174943 non-null object 16 distance 174943 non-null float64 17 age 174943 non-null int32 dtypes: float64(8), int32(1), int64(2), object(7) memory usage: 24.7+ MB
Cleaning age¶
len(df[df['age'] > 80])
263
df = df[df['age'] <= 80]
df.info()
<class 'pandas.core.frame.DataFrame'> Index: 174680 entries, 0 to 183411 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 duration_sec 174680 non-null int64 1 start_time 174680 non-null object 2 end_time 174680 non-null object 3 start_station_id 174680 non-null float64 4 start_station_name 174680 non-null object 5 start_station_latitude 174680 non-null float64 6 start_station_longitude 174680 non-null float64 7 end_station_id 174680 non-null float64 8 end_station_name 174680 non-null object 9 end_station_latitude 174680 non-null float64 10 end_station_longitude 174680 non-null float64 11 bike_id 174680 non-null int64 12 user_type 174680 non-null object 13 member_birth_year 174680 non-null float64 14 member_gender 174680 non-null object 15 bike_share_for_all_trip 174680 non-null object 16 distance 174680 non-null float64 17 age 174680 non-null int32 dtypes: float64(8), int32(1), int64(2), object(7) memory usage: 24.7+ MB
Cleaning duration_sec¶
len(df[df['duration_sec']>2000])
4681
df = df[df['duration_sec'] <= 2000]
df.info()
<class 'pandas.core.frame.DataFrame'> Index: 169999 entries, 4 to 183411 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 duration_sec 169999 non-null int64 1 start_time 169999 non-null object 2 end_time 169999 non-null object 3 start_station_id 169999 non-null float64 4 start_station_name 169999 non-null object 5 start_station_latitude 169999 non-null float64 6 start_station_longitude 169999 non-null float64 7 end_station_id 169999 non-null float64 8 end_station_name 169999 non-null object 9 end_station_latitude 169999 non-null float64 10 end_station_longitude 169999 non-null float64 11 bike_id 169999 non-null int64 12 user_type 169999 non-null object 13 member_birth_year 169999 non-null float64 14 member_gender 169999 non-null object 15 bike_share_for_all_trip 169999 non-null object 16 distance 169999 non-null float64 17 age 169999 non-null int32 dtypes: float64(8), int32(1), int64(2), object(7) memory usage: 24.0+ MB
Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?¶
The bike_id distribution graph indicated that most people tended to use bikes whose ids are in the 5000 range. The user_type count plot indicated that most bike renters are subscribers rather than one-time customers, with 91.2% being subscribers and 8.8% one-time customers. The duration_sec distribution graph indicated most bike rides tended to last 200-800 seconds. The member_gender count plot shows that most bike renters tend to be male; where 74.8% were male, 23.2% were female and 2% identified as other. The age distribution graph showed most bike renter's age were between 25 to 45 years. The distance distribution graph showed most bike trips travelled between 0.25 to 2.75 km.
Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?¶
There some bike riders that were above the age of 80, that seemed unlikely and considered to be outliers. Distances above 10 km seemed to be outliers. There were some distances that are 0 km which indicated that the bikes were rented but not used, which came up to 3548 trips. Durations tended to be within 100 to 2000 range, values above 2000 were considered to be outliers.
Bivariate Exploration¶
df.info()
<class 'pandas.core.frame.DataFrame'> Index: 169999 entries, 4 to 183411 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 duration_sec 169999 non-null int64 1 start_time 169999 non-null object 2 end_time 169999 non-null object 3 start_station_id 169999 non-null float64 4 start_station_name 169999 non-null object 5 start_station_latitude 169999 non-null float64 6 start_station_longitude 169999 non-null float64 7 end_station_id 169999 non-null float64 8 end_station_name 169999 non-null object 9 end_station_latitude 169999 non-null float64 10 end_station_longitude 169999 non-null float64 11 bike_id 169999 non-null int64 12 user_type 169999 non-null object 13 member_birth_year 169999 non-null float64 14 member_gender 169999 non-null object 15 bike_share_for_all_trip 169999 non-null object 16 distance 169999 non-null float64 17 age 169999 non-null int32 dtypes: float64(8), int32(1), int64(2), object(7) memory usage: 24.0+ MB
Bar plots¶
indexes = df['start_station_id'].value_counts().head(10).index
indexes
Index([58.0, 67.0, 81.0, 21.0, 30.0, 3.0, 15.0, 22.0, 16.0, 5.0], dtype='float64', name='start_station_id')
sns.barplot(df[ df['start_station_id'].isin(indexes)] , x='start_station_id',y='distance')
<Axes: xlabel='start_station_id', ylabel='distance'>
These are the top ten start stations with the most that had the most bike rentals, there is no correlation with that and the distances.
indexes = df['end_station_id'].value_counts().head(10).index
indexes
Index([67.0, 58.0, 21.0, 15.0, 30.0, 3.0, 81.0, 16.0, 6.0, 5.0], dtype='float64', name='end_station_id')
sns.barplot(df[df['end_station_id'].isin(indexes)], x='end_station_id',y='distance');
These are the top ten end stations with the most that had the most bike rentals. As with the start stations, there is no correlation with that and the distances.
Box Plots¶
member_gender box plots¶
sns.boxplot(df,x='member_gender',y='age');
The box plot shows only slight differences of age when it comes to each gender category, females tend to be a little younger than males and others.
sns.boxplot(df,x='member_gender',y='distance');
Gender does not affect distances travelled, all genders seem to travel the seem distances on average.
sns.boxplot(df,x='member_gender',y='duration_sec');
Females tend to rent bikes out for longer than males and others
user_type box plots¶
sns.boxplot(df,x='user_type',y='duration_sec');
Customers tend to rent bikes a lot longer than subscribers
sns.boxplot(df,x='user_type',y='distance');
Customers tend to travel slightly further than subscribers
sns.boxplot(df,x='user_type',y='age');
Subscribers are slightly older than customers
Scatter plots¶
sns.scatterplot(df, x='age', y='distance',s=1,alpha=0.5);
There doesn't seem to be a relationship between bike renters ages and distances travelled
sns.scatterplot(df, x='age', y='duration_sec',s=1,alpha=0.5);
There doesn't seem to be a correlation between age and duration rented
sns.scatterplot(df, x='duration_sec', y='distance',s=1,alpha=0.35);
There seems to be a direct correlation between distance and duration, the more distance has been travelled, the longer it has been rented
Grouped bar chart¶
sns.countplot(df, x='user_type',hue='member_gender');
Most bike rented tend to be subscribers. In both subscribers and customers, males are the highest frequency followed by females and others.
Plot Matrix¶
plt.figure(figsize=(15,10))
plt.subplot(2,3,1)
sns.boxplot(df,x='member_gender',y='age');
plt.subplot(2,3,2)
sns.boxplot(df,x='member_gender',y='distance');
plt.subplot(2,3,3)
sns.boxplot(df,x='member_gender',y='duration_sec');
plt.subplot(2,3,4)
sns.boxplot(df,x='user_type',y='distance');
plt.subplot(2,3,5)
sns.boxplot(df,x='user_type',y='age');
plt.subplot(2,3,6)
sns.boxplot(df,x='user_type',y='duration_sec');
This plot matrix shows the previous box plots in a succient view side by side in a horizontal focused manner.
plt.figure(figsize=(10,15))
plt.subplot(3,2,1)
sns.boxplot(df,x='member_gender',y='age');
plt.subplot(3,2,2)
sns.boxplot(df,x='member_gender',y='distance');
plt.subplot(3,2,3)
sns.boxplot(df,x='member_gender',y='duration_sec');
plt.subplot(3,2,4)
sns.boxplot(df,x='user_type',y='distance');
plt.subplot(3,2,5)
sns.boxplot(df,x='user_type',y='age');
plt.subplot(3,2,6)
sns.boxplot(df,x='user_type',y='duration_sec');
This plot matrix shows the previous box plots in a succient view side by side in a vertical focused manner.
sns.pairplot(df, x_vars=['age','duration_sec','distance'], y_vars=['age','duration_sec','distance'] ,diag_kind='hist',plot_kws={"s": 1, "alpha" : 0.35});
This plot matrix shows the scatter plots plotted previously in succinct manner as well as adding those plots with the y and x axeses being reversed to try to find a correlation that way.
Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?¶
Most of the variables followed trends seen in univariate analysis. For example, most bike renters tended to be males followed by females, so in the grouped bar chart, it was found most subscribes were male followed females, the same was found for customers. Most variables in the scatter plots were found to not have a correlation, except for the duration and distance, where bikes that have travelled longer distances tended to be rented for longer periods. Most box plots did not show a tightly correlated variables, but they showed customers tended to rent bikes for longer than subscribers which lead to a slight increase in customer distance travelled as compared to subscriber distance travelled. Box plots also showed women were slighly more likely to rent a bike for longer.
Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?¶
Though not a feature of interest, stations start_station_id and end_station_id did not have many bikes travelling over 2 km, with some exceptions.
Multivariate Exploration¶
Facet Plots¶
g = sns.FacetGrid(df, col="member_gender", row="user_type", height=3, aspect=1.33);
g.map(sns.scatterplot, "duration_sec", "distance", s=1,alpha=0.45);
This facet grid shows that neither gender nor user_type affects the previously discovered correlation between duration and distance.
g = sns.FacetGrid(df, col="member_gender", row="user_type", height=3, aspect=1.33);
g.map(sns.scatterplot, "duration_sec", "age", s=5,alpha=0.45);
This facet grid shows that neither gender nor user_type affects the previously discovered non-correlation between duration and age.
g = sns.FacetGrid(df, col="member_gender", row="user_type", height=3, aspect=1.33);
g.map(sns.scatterplot, "age", "distance", s=5,alpha=0.45);
This facet grid shows that neither gender nor user_type affects the previously discovered non-correlation between distance and age.
Grouped bar plots with three variables¶
sns.barplot(df, x='user_type',y='distance',hue='member_gender');
Customers tend to travel more distance than subscribers on average, with others travelling the most distance.
sns.barplot(df, x='user_type',y='age',hue='member_gender');
Ages of subscribers and customers by gender are relatively the same
sns.barplot(df, x='user_type',y='duration_sec',hue='member_gender');
Customers tend to rent out bikes a lot longer than subscribers. Females rent bikes the longest in both subscribers and customers
Scatter Plot with Multiple Encodings¶
sns.scatterplot(df, x='duration_sec', y='age',hue="member_gender",s=10,alpha=0.35);
leg = plt.legend(title="Gender")
for lh in leg.legend_handles:
lh.set_alpha(1)
lh.set_markersize(8)
Gender does not have any additonial correlation between duration and age.
sns.scatterplot(df, x='duration_sec', y='distance',hue="member_gender",s=10,alpha=0.35);
leg = plt.legend(title="Gender")
for lh in leg.legend_handles :
lh.set_alpha(1)
lh.set_markersize(8)
Gender does not add to the previously mentioned correlation with distance and duration.
Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?¶
Most features tended to follow trends already established in bivariate and univariate analyses. The user_type, gender, duration_sec grouped bar chart showed customers tended to rent out bikes the longest, and of those customers females rented bikes the longest. The multiple encoded scatter plots showed gender did not affect or have a trend in previously established bivariate analyses.
Were there any interesting or surprising interactions between features?¶
Nope.
Conclusions¶
Two columns were added to the dataset, the age column calculated by subtracting the current year by birth year and the distance formula
It was found that there were more male renters as there were other renters, yet female renters tended to rent bikes for longer than others. Distances travelled tended to be less than 4 km. The age did not have any correlation with distance nor the duration. But, there was a direct relationship between distance and duration where the more distance a bike has been ridden for the longer it was rented.
Exporting Cleaned Dataset¶
df.to_csv('cleaned.csv')